Goals:

  1. Explore the structure of data
  2. Assess the quality of the data
  3. Identify metabolites that are different between cancer types or between cell lines with different drug responses.

Requirements:

  1. R or Rstudio
  2. Packages:
    • ggplot2 (to install, type “install.packages(”ggplot2“)” in R)
    • robcor (to install, type “install.packages(”robcor“) in R”)

Input Data:

Users can load an Rdata frame which contains three R objects:

  1. finalmetab: a matrix containing metabolite abundance values.
  2. finalgenes: a matrix containing gene abundance values
  3. finannot: annotation (meta-data) of the samples where the column “cell_line” matches that of the metab matrix.

You can also find all the data in an Excel sheet.

Tips:

1. Explore the data types present in “CleanGeneMetab_NCI_60.Rdata”

In R or Rstudio, load your data (make sure to change yoru directory appropriately to your working directory):

load("/Users/math90/Documents/Teaching/Ethiopia/OHSI_DataAnalytics_Aug2019/Data/NCI60/CleanGeneMetab_NCI_60.Rdata")

Here are some key functions to explore the types of data you have.

First, look to see what objects you have loaded in your working RStudio environment:

ls()
## [1] "finalgenes" "finalmetab" "finannot"   "smallmetab"

You should see 3 objects: finalgenes, finalmetab, finannot.

Now look to see how big these are:

dim(finalgenes)
## [1]    57 17987
dim(finalmetab)
## [1]  57 349
dim(finannot)
## [1] 57  4

You can use the function View() to look at your data bt since they are large, let’s use functions such as dim() (dimentions), nrow() (number of rows), ncol() (number of columnes), rownames() (names of rows), and colnames() (names of columns):

dim(finalgenes)
## [1]    57 17987
nrow(finalgenes)
## [1] 57
ncol(finalgenes)
## [1] 17987
head(rownames(finalgenes))
## [1] "786-0"     "A498"      "A549/ATCC" "ACHN"      "BT-549"    "CAKI-1"
head(colnames(finalgenes))
## [1] "PRPF8"  "CAPNS1" "RPL35"  "EIF3D"  "PARK7"  "SRP14"

Discussion: What do you think these objects represent?

Let’s take a closer look at the object finannot:

head(finannot)
##   cell_line   drugscore   drugcateg cancertype
## 1     786-0  0.05153939 No Response      Renal
## 2      A498 -0.45286203   Resistant      Renal
## 3 A549/ATCC -0.41449565   Resistant      NSCLC
## 4      ACHN -0.03506064 No Response      Renal
## 5    BT-549 -0.34282314   Resistant     Breast
## 6    CAKI-1 -0.03703719 No Response      Renal

How many types of cell lines are there? How many types of cancers represented?

table(finannot$cell_line)
## 
##           786-0            A498       A549/ATCC            ACHN 
##               1               1               1               1 
##          BT-549          CAKI-1        CCRF-CEM        COLO 205 
##               1               1               1               1 
##          DU-145            EKVX        HCC-2998         HCT-116 
##               1               1               1               1 
##          HCT-15       HL-60(TB)          HOP-62          HOP-92 
##               1               1               1               1 
##         HS 578T            HT29          IGROV1           K-562 
##               1               1               1               1 
##            KM12        LOX IMVI             M14        MALME-3M 
##               1               1               1               1 
##            MCF7 MDA-MB-231/ATCC      MDA-MB-435          MOLT-4 
##               1               1               1               1 
##        NCI-H226         NCI-H23       NCI-H322M        NCI-H460 
##               1               1               1               1 
##        NCI-H522     NCI/ADR-RES         OVCAR-3         OVCAR-4 
##               1               1               1               1 
##         OVCAR-5         OVCAR-8            PC-3       RPMI-8226 
##               1               1               1               1 
##          SF-268          SF-295          SF-539       SK-MEL-28 
##               1               1               1               1 
##        SK-MEL-5         SK-OV-3           SN12C          SNB-19 
##               1               1               1               1 
##          SNB-75              SR          SW-620           T-47D 
##               1               1               1               1 
##           TK-10            U251        UACC-257         UACC-62 
##               1               1               1               1 
##           UO-31 
##               1
length(unique(finannot$cell_line))
## [1] 57
table(finannot$cancertype)
## 
##   Breast      CNS    Colon Leukemia Melanoma    NSCLC  Ovarian Prostate 
##        5        6        7        6        8        9        7        2 
##    Renal 
##        7
barplot(table(finannot$cancertype), main = "Cancer Types in NCI-60 Data", names.arg = unique(finannot$cancertype), 
    ylab = "Amount")

This barplot is not pretty, let’s try again and relabel the x-axis:

a = barplot(table(finannot$cancertype), main = "Cancer Types in NCI-60 Data", 
    names.arg = "", horiz = F, ylab = "Amount", xpd = T, ylim = c(-1.5, 10))
text(a, -1, label = unique(finannot$cancertype), srt = 45, offset = 0, xpd = TRUE)

We could also order the cell lines by the numbr of different cancer types:

toplot <- sort(table(finannot$cancertype))
labels <- unique(finannot$cancertype)[order(table(finannot$cancertype))]
a = barplot(toplot, main = "Cancer Types in NCI-60 Data", names.arg = "", horiz = F, 
    ylab = "Amount", xpd = T, ylim = c(-1.5, 10))
text(a, -1, label = labels, srt = 45, offset = 0, xpd = TRUE)

We can now visualize our data that are continous:

hist(finannot$drugscore, main = "Distribution of Drug Score Data", xlab = "Drug Score")

While this gives us a good idea of how the drug scores are distributed, we could also order the drug score data, plot the actual values, and see how those values correspond to the categorical data.

toplot <- sort(finannot$drugscore)
mycol <- finannot$drugcateg[order(finannot$drugscore)]
barplot(toplot, col = mycol)

We can also change the colors

mycol <- gsub("Resistant", "violet", mycol)
mycol <- gsub("No Response", "grey", mycol)
mycol <- gsub("Sensitive", "green", mycol)
barplot(toplot, col = mycol)

And we should also add a legend and labels:

a = barplot(toplot, col = mycol, main = "Drug sensitivity of cell lines", legend = c("Resistant", 
    "No Response", "Sensitive"), args.legend = list(x = "bottomright", fill = c("violet", 
    "grey", "green")))

alllabels <- finannot$cell_line[order(finannot$drugscore)]
toplot <- sort(finannot$drugscore)
toptext <- as.matrix(a[which(toplot < 0), ])
toplabel <- alllabels[which(toplot < 0)]
bottomtext <- as.matrix(a[which(toplot > 0), ])
bottomlabel <- alllabels[which(toplot > 0)]

a = barplot(toplot, col = mycol, main = "Drug sensitivity of cell lines", legend = c("Resistant", 
    "No Response", "Sensitive"), args.legend = list(x = "bottomright", fill = c("violet", 
    "grey", "green")))

text(toptext, 0.05, label = toplabel, srt = 90, offset = 0, pos = 4, cex = 0.5, 
    adj = 0)
text(bottomtext, -0.05, label = bottomlabel, srt = 90, offset = 0, pos = 2, 
    cex = 0.5)

2. Explore the structure of the data

Oftentimes, datasets have missing values. With this particular dataset, misinv values are imputed by the minimum value.One quick way to check for missing values, is to count the number of times the minumum abundance value appears for a given metabolite. One could also calculate the standard deviation and perhaps filter out metabolites with low variation. Let’s do the latter here. (If you have time on your hands, do both!)

Let’s take a look at the number of missing values here per sample

minvals <- c()
for (i in 1:nrow(finalmetab)) {
    minvals <- c(minvals, which(finalmetab[i, ] == min(finalmetab[i, ])))
}
hist(minvals)

paste("There are a total of", ncol(finalmetab), "in my dataset.")
## [1] "There are a total of 349 in my dataset."

Let’s calcualte the standard deviations per sample:

sds = as.numeric(apply(finalmetab, 2, sd))

# Now plot the distribution of standard deviations:
hist(sds, breaks = 1000, main = "Distribution of metabolite abundance sds")

Do you notice any outliers? Let’s zoom in a little:

hist(sds, breaks = 1000, main = "Distribution of metabolite abundance sds", 
    xlim = c(0, 10))

Do you notice more outliers? What metabolites could you remove from further analysis? How many are there? What metabolite has a very high standard deviation?

bads = c(which(sds == 0), which(sds > 6))
length(bads)
## [1] 3
hist(sds[-bads])

colnames(finalmetab)[which(sds >= 10)]
## character(0)

Save original and filter out metabolites

metab = finalmetab
finalmetab = metab[, -bads]
dim(finalmetab)
## [1]  57 346
dim(metab)
## [1]  57 349

2. Assess the quality of the data

Look at the distsribution of metabolite abundances across each sample using a simple boxplot**:

boxplot(as.data.frame(t(metab)), pch = 19)

Does the data need to be transformed? Try this:

mycol = c(rep("slategrey", 5), rep("seagreen4", 6), rep("lightskyblue", 7), 
    rep("blue", 6), rep("lightcoral", 8), rep("indianred4", 9), rep("purple", 
        7), rep("green", 7), rep("orange", 2))

boxplot(as.data.frame(t(log2(finalmetab))), pch = 19, main = "Distribution of Metabolites Per Sample", 
    names = F, col = mycol, ylab = "log2(Normalized metabolite abundances)", 
    ylim = c(-10, 10))

text(x = seq_along(rownames(finalmetab)), y = par("usr")[3] - 0.5, srt = 45, 
    adj = 1, labels = rownames(finalmetab), xpd = TRUE, cex = 0.6)
legend(3, 10, legend = unique(finannot$cancertype), fill = unique(mycol))

Apply an unsupervised clustering method to see how samples cluster together.
For example, using PCA:

library(ggplot2)
mypca = prcomp(log2(finalmetab), center = T, scale = T)
percvar = round((mypca$sdev)^2/sum(mypca$sdev^2) * 100, 2)
# Check that order of mypca matrix is same as sample matrix:
all.equal(rownames(mypca$x), rownames(finalmetab))
## [1] TRUE
mydf = data.frame(PC1 = mypca$x[, "PC1"], PC2 = mypca$x[, "PC2"], Cell_Line = finannot$cell_line, 
    Cancer_Type = finannot$cancertype)

ggplot(mydf, aes(PC1, PC2, color = Cancer_Type)) + geom_point(size = 7) + scale_color_manual(values = unique(mycol)) + 
    xlab(paste0("PC1: ", percvar[1], "% variance")) + ylab(paste0("PC2: ", percvar[2], 
    "% variance")) + theme_bw() + ggtitle("Metabolomics PCA Plot \n log2(normalized values)") + 
    theme(axis.line = element_line(colour = "black"), axis.title = element_text(size = 15, 
        face = "bold"), plot.title = element_text(size = 20, face = "bold"), 
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(), 
        panel.background = element_blank(), legend.key = element_blank())

Try setting center=F and/or scale=F to see what happens.

3. Identify metabolites that are different between cancer types/cell lines

First, do a t-test to compare breast and leukemia groups. First, you’ll need to refilter by standard deviation, why?

mysamples <- c(which(finannot$cancertype == "Breast"), which(finannot$cancertype == 
    "Leukemia"))
mygroups <- finannot$cancertype[mysamples]
# Remove metabolites with sds == 0
sds1 <- as.numeric(apply(finalmetab[which(finannot$cancertype == "Breast"), 
    ], 2, sd))
sds2 <- as.numeric(apply(finalmetab[which(finannot$cancertype == "Leukemia"), 
    ], 2, sd))
newmetab <- finalmetab[mysamples, intersect(which(sds1 > 0), which(sds2 > 0))]

# Now calculate p-values and log fold changes
pval = c()
for (i in 1:ncol(newmetab)) {
    temp = t.test(as.numeric(log2(newmetab[which(mygroups == "Leukemia"), i])), 
        log2(as.numeric(newmetab[which(mygroups == "Breast"), i])))
    
    pval = c(pval, temp$p.value)
}

One quick way to check if you have any significance, is to look at the distribution of the p-values.*

hist(pval, breaks = 100)
abline(v = 0.01, col = "red")

Can you tell whether there will be significant altered metabolites? How?

# Adjust p-values
pval.adj = p.adjust(pval, method = "fdr")

# Calculate fold changes
fc = c()
for (i in 1:ncol(newmetab)) {
    mean1 = mean(as.numeric(newmetab[which(mygroups == "Leukemia"), i]))
    mean2 = mean(as.numeric(newmetab[which(mygroups == "Breast"), i]))
    fc = c(fc, mean1/mean2)
}
range(pval.adj)
## [1] 0.003375347 0.999698829
range(fc)
## [1] 0.03782323 9.61968228

Draw a volcano plot

plot(log2(fc), -log10(pval.adj), pch = 19, xlab = "log2(FC) - Leukemia/Breast", 
    main = "Volcano Plot\nComparing Leukemia and Breast")
# Draw lines to show adjusted p-value cutoff of 0.05, and fold changes < 0.5
# or > 1.5
abline(h = -log10(0.05), col = "red")
abline(v = c(log2(0.5), log2(1.5)), col = "blue")

# Color significan metabolites
mysigs = intersect(c(which(fc > 1.5), which(fc < 0.5)), which(pval.adj < 0.05))
points(log2(fc[mysigs]), -log10(pval.adj[mysigs]), col = "salmon", pch = 19)

QUESTIONS:

  1. How many metabolites are significant based on your criteria (define your criteria)?
  2. What are the metabolite names with highest abundance in breast cancer cell lines?
  3. What are the metabolite names with highest abundance in leukemia cell lines?

3. Assess correlation between genes and metabolites.

In the manuscript, authors evaluate correlations between genes and metabolites using Pairwise Quadrant Correlations. Let’s give this is a try ourselves and look at the distribution of the results. To do this on all the genes (17,987) and all metabolites (353), would result in 6,349,411. That’s a lot! So let’s only do this for a random set of 100 genes and all 353 metabolites.

library(robcor)

# Set the seed so results are reproducible.
set.seed(1)
mygenes <- sample(1:ncol(finalgenes), 100)

# Calculating pairwise quadrant correlations:
mycorPQC <- matrix(ncol = ncol(finalmetab), nrow = 100)
rownames(mycorPQC) <- colnames(finalgenes)[mygenes]
colnames(mycorPQC) <- colnames(finalmetab)
genelog <- log2(finalgenes)
metablog <- log2(finalmetab)
for (i in 1:length(mygenes)) {
    # print(i) We can use the robcor() function here to use pairwise quadrant
    # correlations
    mycorPQC[i, ] <- as.numeric(apply(metablog, 2, function(x) robcor(x, genelog[, 
        mygenes[i]], method = "quadrant")))
}

# Let's also try to calculate Pearson's correlations to compare:
mycorPear <- matrix(ncol = ncol(finalmetab), nrow = 100)
rownames(mycorPear) <- colnames(finalgenes)[mygenes]
colnames(mycorPear) <- colnames(finalmetab)
for (i in 1:length(mygenes)) {
    # We can use the robcor() function here to use pairwise quadrant
    # correlations
    mycorPear[i, ] <- as.numeric(apply(metablog, 2, function(x) cor(x, genelog[, 
        mygenes[i]], method = "pearson")))
}

range(mycorPQC, na.rm = T)
## [1] -1  1
range(mycorPear, na.rm = T)
## [1] -0.6800078  0.6628377

Now we can plot the distributions of the results:

par(mfrow = c(2, 2))
hist(mycorPQC, main = "Pairwise Quadrant Correlations", breaks = 50)
hist(mycorPear, main = "Pearson's Correlations", breaks = 50)
plot(mycorPQC, mycorPear)

To find out which pairs have the highest correlation (e.g. 1):

head(which(mycorPQC == 1, arr.ind = T))
##              row col
## SLC18B1       43  59
## NKX2-1-AS1    57  59
## CCDC134       63  59
## TANK          98  59
## HDAC4          4  81
## LOC101927204  15  81
rownames(mycorPQC)[43]
## [1] "SLC18B1"
colnames(mycorPQC)[61]
## [1] "X-3074"

We can plot a pair to see:

plot(genelog[, rownames(mycorPQC)[43]], metablog[, colnames(mycorPQC)[61]], 
    xlab = rownames(mycorPQC)[43], ylab = colnames(mycorPQC)[61])

4. Assess drug sensitivity.

The cell lines have been treated with > 20,000 compounds. A score has been calculated for each compound, IC-50, which represents the amount of the drug needed to inhibit biological activity by 50%. The average IC-50 value for each cell line across all drugs was calculated and normalized into a z-score. Based on this average z-score (drug score), cell lines are categorized as “Resistant”, “Sensitive”, or “No Response”.

Let’s take a look at the data:

table(finannot$drugcateg)
## 
## No Response   Resistant   Sensitive 
##          30          16          11
hist(finannot$drugscore, breaks = 50)

range(finannot$drugscore[which(finannot$drugcateg == "Resistant")])
## [1] -1.092312 -0.305111
range(finannot$drugscore[which(finannot$drugcateg == "No Response")])
## [1] -0.2973441  0.1903824
range(finannot$drugscore[which(finannot$drugcateg == "Sensitive")])
## [1] 0.3082282 0.8160741

We could not look at associations between drugs and metabolite or gene levels. Let’s use pairwise quadrant correlations which are more robust than Pearson’s. For example:

library(robcor)

# We can use the robcor() function here to use pairwise quadrant
# correlations
mycorPQC <- as.numeric(apply(finalmetab, 2, function(x) robcor(x, finannot$drugscore, 
    method = "quadrant")))

# Let's also calculate Pearson's to compare:
mycorPear <- as.numeric(apply(finalmetab, 2, function(x) cor(x, finannot$drugscore, 
    method = "pearson")))

range(mycorPQC)
## [1] -1  1
range(mycorPear)
## [1] -0.5016736  0.4419600

Let’s now make some plots to look at these:

par(mfrow = c(2, 2))
hist(mycorPQC, main = "Pairwise Quadrant Correlation")
hist(mycorPear, main = "Pearson Correlation")
plot(mycorPQC, mycorPear, xlab = "PQC", ylab = "Pearson Correlation", pch = 19)

Printout of session info in RStudio as a good housekeeping practice.

sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.5
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] robcor_0.1-6  ggplot2_3.2.0
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.2       knitr_1.23       magrittr_1.5     tidyselect_0.2.5
##  [5] munsell_0.5.0    colorspace_1.4-1 R6_2.4.0         rlang_0.4.0     
##  [9] stringr_1.4.0    dplyr_0.8.3      tools_3.6.1      grid_3.6.1      
## [13] gtable_0.3.0     xfun_0.8         withr_2.1.2      htmltools_0.3.6 
## [17] assertthat_0.2.1 yaml_2.2.0       lazyeval_0.2.2   digest_0.6.20   
## [21] tibble_2.1.3     crayon_1.3.4     purrr_0.3.2      formatR_1.7     
## [25] glue_1.3.1       evaluate_0.14    rmarkdown_1.14   labeling_0.3    
## [29] stringi_1.4.3    compiler_3.6.1   pillar_1.4.2     scales_1.0.0    
## [33] pkgconfig_2.0.2